Minimizing write operation for multi-dimensional DSP applications via a two-level partition technique with complete memory latency hiding
نویسندگان
چکیده
Most scientific and digital signal processing (DSP) applications are recursive or iterative. The execution of these applications on a chip multiprocessor (CMP) encounters two challenges. First, as most of the digital signal processing applications are both computation intensive and data intensive, an inefficient scheduling scheme may generate huge amount of Write operation, cost a lot of time, and consume significant amount of energy. Second, because CPU speed has been increased dramatically compared with memory speed, the slowness of memory hinders the overall system performance. In this paper, we develop a Two-Level Partition (TLP) algorithm that can minimize Write operation while achieving full parallelism for multi-dimensional DSP applications running on CMPs which employ scratchpad memory (SPM) as on-chip memory (e.g., the IBM Cell processor). Experiments on DSP benchmarks demonstrate the effectiveness and efficiency of the TLP algorithm, namely, the TLP algorithm can completely hide memory latencies to achieve full parallelism and generate the least amount of Write operation to main memory compared with previous approaches. Experimental results show that our proposed algorithm is superior to all known methods, including the list scheduling, rotation scheduling, Partition Scheduling with Prefetching (PSP), and Iterational Retiming with Partitioning (IRP) algorithms. Furthermore, the TLP scheduling algorithm can reduce Write operation to main memory by 45.35% and reduce the schedule length by 23.7% on average compared with the IRP scheduling algorithm, the best known algorithm.
منابع مشابه
Minimization of Memory Access Overhead for Multi-dimensional Dsp Applications via Multi-level Partitioning and Scheduling
Massive uniform nested loops are broadly used in multi-dimensional DSP applications. Due to the large amount of data handled by such applications, the optimization of data accesses by fully utilizing the local memory and minimizing communication overhead is important in order to improve the overall system performance. Most of the traditional partition strategies do not consider the eeect of dat...
متن کاملOptimal loop scheduling for hiding memory latency based on two-level partitioning and prefetching
The large latency of memory accesses in modern computers is a key obstacle in achieving high processor utilization. As a result, a variety of techniques have been devised to hide this latency. These techniques range from cache hierarchies to various prefetching and memory management techniques for manipulating the data present in the caches. In DSP applications, the existence of large numbers o...
متن کاملMulti-level partitioning and scheduling under local memory constraint
Massive uniform nested loops are broadly used in scientiic and DSP applications. Due to the large amount of data handled by such applications, the optimization of data accesses by fully utilizing the local memory and minimizing communication overhead is important in order to improve the overall system performance. Most of the traditional partition strategies do not consider the eeect of data ac...
متن کاملThe E ects of Architecture on the Performance of Latency
We study the eeects of cache organization, caching policy and network capacity on the performance of latency hiding via fast context switching in large-scale shared memory multiprocessors. We describe a technique that supports hardware or software-initiated switches that works on a commercially available processor with register windows. Signiicant performance improvements (120%) can be achieved...
متن کاملThe Effects of Architecture on the Performance on Latency Hiding Via Rapid Context Switching
We study the e ects of cache organization caching policy and network capacity on the performance of latency hiding via fast context switching in large scale shared memory multiprocessors We describe a technique that supports hardware or software initiated switches that works on a commercially avail able processor with register windows Signi cant per formance improvements can be achieved with la...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Systems Architecture - Embedded Systems Design
دوره 61 شماره
صفحات -
تاریخ انتشار 2015